Analysis of Wine Qualities by Igor Stojanovic

Introduction

The author decided in the first place to have a look at both wine data sets, which have been provided by udacity. By having a quick look into the datasets, the author discoverd that the data sets are very similar. Instead of only analysing one of these datasets, the author decided to combine both data sets to an aggregated data set.

The author always had a big interest in wine - red or white. An aggregated data set allows the author to explore both types of wine and see if there are differences among the type of wines. Although, the author itself likes to taste and discuss about wine, he never had the opportunity to analyse an aggregated data set with this many values from a laboratory.

THe most interesting overall question is going to be: What aspects define a high quality wine? But the data set offers even more potential for exploration. The author will try to uncover hidden relationship at this point among variables, as well as to try to uncover differences among the types of wine. Last, the author will try to build a predictive model to forecast the quality of a wine.

Both data sets were obtained from Cortez, 2009 (Cortez et al., 2009).

Dataset

Since variable 1 (row number) was not needed, the variable was dropped from further analysis. On the other hand, the author added in both data sets a new variable called “type”, which containes either the value “red” or “white”, corresponding from its origine data set. The final data set for the analysis containes the following variables:

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

13- type (red or white)

Description of variables

Description of variable (Cortez et al., 2009):

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

13 - type: either red or white wine.

Univariate Plots Section

## [1] "dimension of red wine is : 1599 observations and 13 variables"
## [1] "dimension of white wine is : 4898 observations and 13 variables"
## [1] "dimension of complete data set is : 6497 observations and 13 variables"
## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : chr  "red" "red" "red" "red" ...

All of the varaibles have the desired class. Therefore, no further data wrangling is needed at this point.

## df_wine$type: red
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality          type          
##  Min.   :3.000   Length:1599       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.636                     
##  3rd Qu.:6.000                     
##  Max.   :8.000                     
## -------------------------------------------------------- 
## df_wine$type: white
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality          type          
##  Min.   :3.000   Length:4898       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.878                     
##  3rd Qu.:6.000                     
##  Max.   :9.000

The first quick summary statistic, which was grouped by the type (red or white) of wine, indicated interesting differences. Red wine seems to contain higher levels of fixed and volitile acidity on average, while white wine shows higher levels of citric acid. White wine seem to have more residual sugar and alcohol and higher density. These variables might have a relationship with each other, what would be explored later. Also the data indicates that white wine has a better quality in average compared to red.

## df_wine$type: red
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000 
## -------------------------------------------------------- 
## df_wine$type: white
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

The first chart reveals that both quality variables are slightly skewed to the right. It furhter follows the summary statistic by indicating that the maximum value in terms of quality for white wine is 9, while the maximum of red wine was 8. But it reveals that the the value of 9 was only reached very rarely.The boxplot shows these datapoints as outliers, but they are kept since they are valid data points. The binwidth further showed that all values are integers. The boxplot indicates that white wine has a slighter higher mean quality than red wine. The bar plot finally reveals that there are much more white wine (approx. 4900) than red wine (approx. 1700) in the data set. It would be interesting to look at this variable under acidity, residual sugar, pH and alcohol.

Acidity variables

## df_wine$type: red
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010 
## -------------------------------------------------------- 
## df_wine$type: white
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH plot shows a right skewed distribution of the pH values, independently of the wine type. Most data point are within 3 and 3.5, whereby some data points reach slightly lower (min. 2.74) or higher (max. 4.01) values. All data points seem to be reasonable, indicating that all wines in the data set seem to be more acid than basic. Let’s see how the distribution of fixed acidity looks like.

## df_wine$type: red
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90 
## -------------------------------------------------------- 
## df_wine$type: white
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed acidity follows the distribution sheme of the other variables and is therefore also skewed to the right. Most values reach a value between 4 and 8. The boxplot revales that white whine has normally lower levels of fixed acidity, whereas the red wine reaches higher levels. Eventhough, there are some outliers, they still seem to be valid daat points and therefore, they are not dropped from the data set. What pattern can we find in volatile acidity?

It is interesting to find the same pattern within volatile and fixed acidity. The histogram reveals that the variables are skewed to the right, whereas the boxplot reveals that red wine tend reaches higher levels of volatile acidity than white wine. The boxplot of the white wine sections indicates many oujtliers, which may be removed. Nevertheless, we will keep these data points for further analysis. Let’s have a look for the last type of acidity: citric acid.

Interestingly, this variable looks more normally distributed. Even more surprising, this variable reveals that white whine tend to have higher levels of citric acid compared to red wine. The disctription of the variable may reveal the reason why:

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

Red wine usually is not famous for its “refreshing” taste. This might be the reason why white wine contains higher levels of citric acid. Here as well, outliers look like valid data points and are kept in the data set for further analysis.

Summary Univariate Acidity Plots

The pH chart revelead some very interesting insights. It shows that all wines (independently of the type) of the data set have values between ~2.75 and ~4.0. This means that all wines in the sample are rather acidic than basic. The author wanders if this is the case for all wine or just a coincidence. But this quenstion is not subject to be answered within this analysis. The chart further reveals that the distribution is very similar among the wine types in this regard.

The other acidity levels look mostly normally distributed or slightly skewed to the right. It generally refelcts that red wine has higher level of acidity, except for citric acid. The range for fixed acitiy, where most values appear, contains values between 4 and 12, volatile acidity contains values between 0.1 and 1, and most values for citric acid range from 0 to 0.75.

Distribution of Residual Sugar and Alcohol

Let’s go away from the acidity variables and have a deeper look into the sweetness and alcohol content of the wines.

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 3253           7.9            0.330        0.28           31.6     0.053
## 3263           7.9            0.330        0.28           31.6     0.053
## 4381           7.8            0.965        0.60           65.8     0.074
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 3253                  35                  176 1.01030 3.15      0.38
## 3263                  35                  176 1.01030 3.15      0.38
## 4381                   8                  160 1.03898 3.39      0.69
##      alcohol quality  type
## 3253     8.8       6 white
## 3263     8.8       6 white
## 4381    11.7       6 white
## df_wine$type: red
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500 
## -------------------------------------------------------- 
## df_wine$type: white
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.379   9.900  31.600

The 1st chart shows that show that values between 1 and 20 are very frequent for residual sugar with an outlier over 60. This might be due to an entry error, as this data point seems very far from the others. The 2nd charts indicate that most of the wines have residual sugar in a range of 1 to 3, nevertheless the chart shows that white wines tend to have more frequently higher values than red wines. This is even more evident in the boxplot. Let’s jump to alcohol.

Note: Remove data points over 60 for furhter anlysis.

The chart reveals a right skewed distribution of alcohol. The alcohol chart shows that most wines have a alcohol percentage of 9 - 13%, whereas only a few reach values below 8.4% and only a few contain more than 13% alcohol. This trend is similar among red and white wine.

Univariate Analysis Summary

What is the structure of your dataset?

The author created 3 data sets: - df_red containing only red wine sampels (1599 observations, 13 variables) - df_white containing only white wine samples (4898 observations, 13 variables) - df_wine aggregated data set with both samples (6497 observations, 13 variables)

What is/are the main feature(s) of interest in your dataset?

The author would argue that there are 2 main features of interest in the data set. The ultimate interest in the data set is the quality variable. The author would like to determine which factors influence the quality of a wine and which not. Furthermore, the author would like to explore the differences of the two wine samples and test if higher scores in quality of white wine are statistically significant.

What other features in the dataset do you think will help support your
The author would like to explore the quality variable especially with the influence of acidity (all variables), sweetness (residual sugar) and alcohol.

Did you create any new variables from existing variables in the dataset?

The author creates the variable “type” in the aggregated data set and removed the original row number variable “X”. For now, this was everything. The author might create in the further process of the analysis other variables, e.g. cut and label the fixed acitiy variable.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Residual sugar had a very uncommon outlier with a value over 60 This data point was removed for the further analysis.

Bivariate Plots Section

The pair plot showed some interesting insights. Our data set does not show any high correlations that are related to quality. The highest correlation that can be obsorved with regard to quality is alcohol (corr 0.444). The combined data set further highlights a weeker correlation between density and quality (corr -0.312), as well as chlorides (corr -0.201) and volatile acidity (corr -0.266) with quality. All other values seem to have a very week correlation with quality (higher -0.2 or lower 0.2). With disregard of the quality, the highest correlations in the data set can be found among:

Interestingly, the data set reveals different insights by subsetting the data by the type of wine. The quality of white wine seems rather to be determined by volitile acidity (corr -0.195), chlorides (corr -0.21), density (corr -0.307) and alcohol (0.436), while the quality of red wine seem also to be influenced by volitile acidity (corr -0.391), alcohol (corr 0.476) and density (corr -0.175), but also by citric acidity (corr 0.226), total sulfur dioxide (corr -0.185) and sulphates (corr 0.251).

The correlation pair plot seem to indicate that wine type adds variance in the quality of a wine. Although, all correlations tend to be rather week than strong, where alcohol has the highest correlation with quality in all data sets.

Variables with higher correlation in combined data set

Lets explore the variables who have higher level of correlations with quality in the combined data set.

The above plotted chart show the relationship with each criteria with higher correlation with quality. The jitter (points) in the chart show that the correlations tend to be rather small and not linear. Nevertheless, we can notice that alcohol level decreases from quality 3-5 and than it starts to increase with higher quality. Density, chlorides and volatile acidity in regard of quality, shows a lot of variance in the chart, but overall it seems that wine with higher quality ratings are less dense and have lower levels of chlorides and volatile acidity.

It seems overall surprising to see that alcohol is the only variable to show a clearer picture with its relationship with quality. But even more, it is interesting to see that alcohol seems to be the main driver for quality compared to the other variables.

Other variables of interest regarding quality in combined data se

Let’s have a look how residual sugar and pH-level might influence the quality of a wine. One could think that the level of sweetness our sourness might influenc the quality of a wine.

These charts even further indicate that there is no direct relationship with the sweetness or pH level of a wine with its quality. It follows our correlation levels. The results have been quite surprising so far, indicating that there is no clear indicator to determine what factor raise the quality of a wine (except for alcohol). A reason might be that individuals tastes are too different from each other and a further subsetting of the data by test group (irregular, regular, experts, etc.) may be needed to bring a clear pattern in it.

Relationships between supporting variables in combined data set

If we are not able to find clear relation ships with the quality variable, let’s investigate other relationships in the data set. It is interesting to find the highest correlations in the combined data set are mostly related to density of a wine. Residual sugar and fixed acidity show a positive correlation with density, while alchol shows a negative correlational relationship with density. We could probably try to build a model to predict the density of a wine, but this aspect is not of major interest for us. The relationship between volatile acidity and total sulfur dioxide looks also kind of linear, but is also not of major interest.

Let’s have a look on the relationship between pH and citric acid, sulphates and chlorides and total sulfur dioxide and free sulfur dioxide. This correlations were among the highest found in the data set.

Citric acid tend to influence the pH value of a wine. It seems like higher levels of citric acid lead to even more overall acid wine (lower pH values). The strongest relationship can be obsorved between total and free sulfur dioxide. These relationship seem to be the strongest in the whole data set. But since non of these correlate highly with the quality of a wine, they are not of major interest.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

To sum up, there were no high correlating relationships found in the data set. The highest relationship was found among alcohol and quality of wine, independently of the type of wine. Beside alcohol, the density of a wine seem to have the strongest influence on quality. What was interesting to see is that the wine types seperated indicated different correlations with variables, indicating that wine type would add some variance in a comprehensive model. Residual sugar and pH value seem to be not related to the perceived quality of a wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There are some relationships with medium to high level of correlation in the data set:

  • density and fixed acidity (0.466)

  • total sulfur dioxide and volitile acidity (-0.415)

  • pH and citric acid (-0.33)

  • density and residual sugar (0.539)

  • sulphates and chlorides (0.396)

  • total sulfur dioxide and free sulfur dioxide (0.721)

  • alcohol and density (-0.701)

It is really questionable if these relationship are of high interest. Probably, it depends of the interest group studying the topic. But for the authors aim of the analysis to determine factors influencing the quality of a wine, these relationships - bside showing a higher level of correlations - are not of major importance.

What was the strongest relationship you found?

The highest correlation that can be obsorved with regard to quality is alcohol (corr 0.444). Among other variables a high correlationg among total sulfur dioxide and free sulfur dioxide (0.721) exists.

Multivariate Plots Section

Let’s first have a look at the highest correlation levels with quality and subset the data by type.

This image draws a very intresting image. It is very interesting to see that red and white wine follow similar trends and have also similar level of correlations (white: 0.44, red: 0.48). It seems that up to a medium level of quality (up to rating 5), the alcohol content does not to be of gratest importance, but as better the wine gets, the alcohol level increases. Let’s see how density varies depending on type and quality.

Interestingly, the quality of red wine show a much lower correlation (-0.175) with density. The chart also seem to indicate the reason for the lover correlation, since the range of white wine is way higher. By narrowing the desnsity range of white wine between >= 0.990 and <= 1.005, the correlation decreases to (-0.258). The chart further shows that red wine seem to have rather constant level of density (or just very little changes), independently of the level of quality But nevertheless, the influence of the wine type is small. Let’s have a look on chlorides.

We can observe in the plot that wines with higher quality tend to have lower levels of chlorides. This is a bit better visible for white wine instead of red. But nevertheless, the relationships seems to be rather week and with a lot of variance within the variables.

Differences Among Wine Types

Interestingly, the data set reveals different insights by subsetting the data by the type of wine. The quality of white wine seems rather to be determined by volitile acidity (corr -0.195), chlorides (corr -0.21), density (corr -0.307) and alcohol (0.436), while the quality of red wine seem also to be influenced by volitile acidity (corr -0.391), alcohol (corr 0.476) and density (corr -0.175), but also by citric acidity (corr 0.226), total sulfur dioxide (corr -0.185) and sulphates (corr 0.251).

In this chart we can understand the relationship between quality and citric acid seperated by wine type. The image indicates that in case of white wine, there is hardly any movement visible and many outliers can be obsorved. In case of red wine, the chart revels that with higher quality, red wine tend to have higher levels of citric acid. Let’s have a look on sulphaets.

This chart also highlights the different relationship between quality and sulphates on the different wine types. While red wine tend to have an increased level of sulphates in higher quality wine, this does not seem to be the case for white wine as well.

Feature Variables Multivariate Analysis

Having analyzed the differences and relationships on quality by wine types, let’s have a closer look on the relationship of density with residual sugar, as well as sulphates and alcohol.

The first plot reveals that wines with lower levels of residual sugar, but higher levels of alcohol tend to decrease the density of a wine. Or in other words, wines with higher density tend to have high levels of residual sugar and lower levels of alcohol.

In the case of red wine, the chart indicates that higher levels of sulphates and alcohol tend to achieve higher ratings in terms of quality, while this relation tends to be less obvious for white wine. White wine seem to increase the perceived quality with higher alcohol content, but not necessarily with higher levels of sulphates.

Inferenetial statistics

## 
## Calls:
## m1: lm(formula = I(quality ~ I(alcohol)), data = df_wine)
## m2: lm(formula = quality ~ I(alcohol) + density, data = df_wine)
## m3: lm(formula = quality ~ I(alcohol) + density + chlorides, data = df_wine)
## m4: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity, 
##     data = df_wine)
## m5: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type, data = df_wine)
## m6: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type + fixed.acidity, data = df_wine)
## m7: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type + fixed.acidity + pH, data = df_wine)
## m8: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type + fixed.acidity + pH + residual.sugar, data = df_wine)
## m9: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type + fixed.acidity + pH + residual.sugar + citric.acid, 
##     data = df_wine)
## m10: lm(formula = quality ~ I(alcohol) + density + chlorides + volatile.acidity + 
##     type + fixed.acidity + pH + residual.sugar + citric.acid + 
##     sulphates, data = df_wine)
## 
## ================================================================================================================================================================
##                          m1            m2            m3            m4            m5            m6            m7            m8            m9           m10       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)           2.405***      2.494        -8.224       -37.526***    -27.468***    -37.649***    -38.263***     93.917***     93.524***    134.476***  
##                        (0.086)       (4.678)       (4.825)       (4.801)       (5.260)       (5.739)       (5.796)      (14.833)      (14.894)      (15.383)    
##   I(alcohol)            0.325***      0.325***      0.325***      0.385***      0.365***      0.380***      0.382***      0.244***      0.245***      0.193***  
##                        (0.008)       (0.011)       (0.011)       (0.011)       (0.012)       (0.012)       (0.013)       (0.019)       (0.019)       (0.020)    
##   density                            -0.088        10.827*       40.034***     30.345***     40.764***     41.579***    -92.695***    -92.299***   -133.863***  
##                                      (4.618)       (4.773)       (4.752)       (5.184)       (5.691)       (5.793)      (15.031)      (15.092)      (15.590)    
##   chlorides                                        -2.491***     -0.256        -0.792*       -0.710*       -0.737*       -0.131        -0.114        -0.675*    
##                                                    (0.296)       (0.300)       (0.321)       (0.321)       (0.323)       (0.327)       (0.333)       (0.335)    
##   volatile.acidity                                               -1.478***     -1.662***     -1.717***     -1.718***     -1.683***     -1.691***     -1.579***  
##                                                                  (0.063)       (0.075)       (0.076)       (0.076)       (0.075)       (0.080)       (0.080)    
##   type: white/red                                                              -0.157***     -0.199***     -0.211***     -0.513***     -0.510***     -0.494***  
##                                                                                (0.034)       (0.035)       (0.038)       (0.049)       (0.050)       (0.050)    
##   fixed.acidity                                                                              -0.040***     -0.044***      0.073***      0.074***      0.103***  
##                                                                                              (0.009)       (0.011)       (0.016)       (0.016)       (0.017)    
##   pH                                                                                                       -0.055         0.516***      0.513***      0.591***  
##                                                                                                            (0.073)       (0.093)       (0.093)       (0.093)    
##   residual.sugar                                                                                                          0.058***      0.058***      0.074***  
##                                                                                                                          (0.006)       (0.006)       (0.006)    
##   citric.acid                                                                                                                          -0.024        -0.071     
##                                                                                                                                        (0.080)       (0.079)    
##   sulphates                                                                                                                                           0.740***  
##                                                                                                                                                      (0.076)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared             0.197         0.197         0.206         0.268         0.270         0.272         0.272         0.283         0.283         0.293     
##   adj. R-squared        0.197         0.197         0.206         0.267         0.269         0.272         0.271         0.282         0.282         0.292     
##   sigma                 0.782         0.782         0.778         0.748         0.746         0.745         0.745         0.740         0.740         0.735     
##   F                  1597.432       798.593       561.649       592.939       480.165       404.510       346.780       319.444       283.920       268.520     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -7622.694     -7622.694     -7587.547     -7325.456     -7314.686     -7304.980     -7304.697     -7258.221     -7258.178     -7211.691     
##   Deviance           3975.690      3975.689      3932.899      3628.008      3615.999      3605.209      3604.895      3553.680      3553.632      3503.134     
##   AIC               15251.389     15253.388     15185.093     14662.911     14643.373     14625.960     14627.394     14536.443     14538.356     14447.383     
##   BIC               15271.726     15280.504     15218.988     14703.585     14690.825     14680.192     14688.404     14604.232     14612.924     14528.730     
##   N                  6496          6496          6496          6496          6496          6496          6496          6496          6496          6496         
## ================================================================================================================================================================

The first predictive model built shows that the highest r-squared value that we can achieve was 0.293 with 10 variables. This seems to be a rather unprecize model with a lot of variables in it. Let’s try to reduce it’s complexity by keeping the precision (even though it’s on a low level).

## 
## Calls:
## m11: lm(formula = I(quality ~ I(alcohol)), data = df_wine)
## m12: lm(formula = quality ~ I(alcohol) + chlorides, data = df_wine)
## m13: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity, 
##     data = df_wine)
## m14: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity + 
##     type, data = df_wine)
## m15: lm(formula = quality ~ I(alcohol) + chlorides + volatile.acidity + 
##     type + residual.sugar, data = df_wine)
## 
## ==========================================================================================
##                         m11           m12           m13           m14           m15       
## ------------------------------------------------------------------------------------------
##   (Intercept)           2.405***      2.717***      2.911***      3.320***      2.883***  
##                        (0.086)       (0.094)       (0.091)       (0.105)       (0.113)    
##   I(alcohol)            0.325***      0.308***      0.319***      0.313***      0.349***  
##                        (0.008)       (0.008)       (0.008)       (0.008)       (0.009)    
##   chlorides                          -2.308***      0.161        -0.798*       -0.601     
##                                      (0.285)       (0.298)       (0.322)       (0.320)    
##   volatile.acidity                                 -1.337***     -1.667***     -1.690***  
##                                                    (0.061)       (0.075)       (0.074)    
##   type: white/red                                                -0.236***     -0.326***  
##                                                                  (0.031)       (0.032)    
##   residual.sugar                                                                0.023***  
##                                                                                (0.002)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.197         0.205         0.260         0.266         0.277     
##   adj. R-squared        0.197         0.205         0.259         0.266         0.277     
##   sigma                 0.782         0.779         0.752         0.748         0.743     
##   F                  1597.432       839.366       758.756       588.622       498.486     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -7622.694     -7590.120     -7360.771     -7331.792     -7281.392     
##   Deviance           3975.690      3936.016      3667.670      3635.092      3579.121     
##   AIC               15251.389     15188.239     14731.541     14675.583     14576.783     
##   BIC               15271.726     15215.355     14765.436     14716.257     14624.236     
##   N                  6496          6496          6496          6496          6496         
## ==========================================================================================

The second predictive model reached a highest r-squared value of 0.277 and is not much worse than the first one (r-squared: 0.293). The second model only takes 5 variables as input and is therefore much simpler than the first one without losing a lot on its precison.

## 
##  Shapiro-Wilk normality test
## 
## data:  df_white$quality
## W = 0.88904, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  df_red$quality
## W = 0.85759, p-value < 2.2e-16
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df_wine$quality by df_wine$type
## W = 3311000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##      red    white 
## 5.636023 5.877884

Since the Shapiro-Wilk nomrality test revealed that our data set isn’t normally distributed, a comparison of the mean quaity of red and white wine was made with the wilcox test. It revealed that there is a significant difference between the quality of red and white wine, suggestion that white wine tend to have higher quality in average than red wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

This section tried to digg deeper into commonalities and differences of red and white wine. It was internesting to see the different variables influencing quality seperated by wine type. Especially, it is interesting to see the behavior of red and white wine quality influenced by alcohol, which seems to be the only real variable driving the quality of wine. All other variables(density, chlorides, citric acid and residual sugar) seem to have weaker influence on quality, which also follows the indication of low overall correlation levels found in the previous section. An interesting observation was made by plotting sulphates and alcohol and facetting those variables by type and color them with the quality score. It seems like red wines with higher levels of alcohol and sulphates receive higher quality scores, whereas this relationship is not clearly visible on white wine.

Were there any interesting or surprising interactions between features?

There is an interesting relationship between residual sugar, alcohol and density. More dense wine with lower alcohol level tend to have higher levels of residual sugar. But even in this relationship a clear conclusion or pattern is hard to identify.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The author tried to create a model to predict the quality outcome of a wine based on several inputs. But overall, the model is not very precise as the R-squared value is maximized with a level of 0.293. This replicates as well that only lower levels of correlation can be found in the data set. Consequently, predicting the quality of a wine with the existing variables is rather hard.

The best model in terms of amount of number of variabels and effectivness can be built with the variables chlorides, volatile aciidity, type and residual sugar.

The author created further tried to test the significance of the average quality of the wine types. A first test indicated that the quality variable is not normaly distributed and therefore, a wilcox test was applied. The result show that white wine receives in average better ratings in terms of quality than red wine.


Final Plots and Summary

Plot One

Description One

This first plot shows the distribution of quality ratings split by red and white wine. One can see that both wine types are rated in a similar mannner. The interquartile range is 1 at both wine types, showing that most quality ratings are between 5 and 6. Lowest rating received were 3 for both wine types, while white wine received a maximum rating of 9 compared with the maximum of 8 for red wine. Although the plot shows similar figures, one can see that the avarage rating of white wine is higher compared to the red wine. The Shaprio-Wilk test revealed that the data of both wine types are not normally distributed. Furthermore, the wilcox test indicated that there is a statiscally significant difference in the mean quality rating of red and white wine.

Plot Two

Description Two

Correlation test indicated that in all 3 data sets do not contain very high (+/- 0.75) correlations with the quality variable. The highest correlation with could be obsorved with alcohol. This relationship is also visible in the 2nd plot, which shows that the perceived quality of wines, independently from its type, tend to be better with higher alcohol content. It also visualizes why similar levels of correlations can be found among the different data sets (red: 0.436 , white: 0.476)

Plot Three

Description Three

Plot 3 reveals interesting insights between the relationship of alcohol content and sulphates, colored by wine quality and seperated by wine type. In the case of red wine, it indicates that higher levels of sulphates and alcohol tend to achieve higher ratings in terms of quality, while this relation tends to be less obvious for white wine. White wine seem to increase the perceived quality with higher alcohol content, but not necessarily with higher levels of sulphates.


Reflection

The EDA task was a very interesting project for the author trying to get insights into the chemical substances of wine and its influence on quality. Although the author was very exited in the beginning of the project, he had to deal with some obstacles and disappointments. It was nice to gain an understanding of the single variables of the data set, but as soon as the varaiables were plotted against each other, it was rather disappointing to see lower levels of correlations, especially regarding quality. The author at that point assumed that it is not gonna be easy to try to create a good predictive model for the quality of wine. Nevertheless, it was an interesting overall project for the author to apply his new skills on a data set with only little guidance of the template. On the other hand, this point makes it hard to decide when you reach your ending point of the analysis, as the analysis could go forever. The author found it helpful to have clear goals to understand primarily the influence on quality, rather the influence on chemical substances among each other.

This aspect could certainly offer possibilities for future research, on the other hand the analysis show that we need other data or metrics to understand how quality of wine is perceived from an individual to build better models. It would be maybe interesting to add as well the kind of grapes or the level of wine expereience of the individual testers.

Overall, the project was a great expereince.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib